String kernels and similarity measures for information retrieval

نویسنده

  • André Martins
چکیده

Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also been subject of recent research. The goal is to classify finite sequences without explicit knowledge of their statistical nature: sequences are considered similar if they are likely to be generated by the same source. There is experimental evidence that relative entropy (albeit not being a true metric) yields high accuracy in several classification tasks. Compression-based techniques, such as variations of the Ziv-Lempel algorithm for text, or GenCompress for biological sequences, have been used to estimate the relative entropy. Algorithmic concepts based on the Kolmogorov complexity provide theoretic background for these approaches. This paper describes some string kernels and information theoretic methods. It evaluates the performance of both kinds of methods in text classification tasks, namely in the problems of authorship attribution, language detection, and cross-language document matching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Kernels and Similarity Measures for Text Classification

Measuring similarity between two strings is a fundamental step in text classification and other problems of information retrieval. Recently, kernel-based methods have been proposed for this task; since kernels are inner products in a feature space, they naturally induce similarity measures. Information theoretic (dis)similarities have also been the subject of recent research. This paper describ...

متن کامل

Harry: A Tool for Measuring String Similarity

Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small t...

متن کامل

String Metrics and Word Similarity applied to Information Retrieval

Over the past three decades, Information Retrieval (IR) has been studied extensively. The purpose of information retrieval is to assist users in locating information they are looking for. Information retrieval is currently being applied in a variety of application domains from database systems to web information search engines. The main idea of it is to locate documents that contain terms the u...

متن کامل

String Re-writing Kernel

Learning for sentence re-writing is a fundamental task in natural language processing and information retrieval. In this paper, we propose a new class of kernel functions, referred to as string re-writing kernel, to address the problem. A string re-writing kernel measures the similarity between two pairs of strings, each pair representing re-writing of a string. It can capture the lexical and s...

متن کامل

An Introduction to String Re-Writing Kernel

Learning for sentence re-writing is a fundamental task in natural language processing and information retrieval. In this paper, we propose a new class of kernel functions, referred to as string rewriting kernel, to address the problem. A string re-writing kernel measures the similarity between two pairs of strings. It can capture the lexical and structural similarity between sentence pairs with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006